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Abstract.  The  sample  average  approximation  approach  to  solving  stochastic  programs 
induces  a  sampling  error,  caused  by  replacing  an  expectation  by  a  sample  average,  as  well 
as  an  optimization  error  due  to  approximating  the  solution  of  the  resulting  sample  average 
problem.  We  obtain  an  estimator  of  the  optimal  value  of  the  original  stochastic  program  after 
executing  a  finite  number  of  iterations  of  an  optimization  algorithm  applied  to  the  sample 
average  problem.  We  examine  the  convergence  rate  of  the  estimator  as  the  computing  budget 
tends  to  infinity,  and  characterize  the  allocation  policies  that  maximize  the  convergence  rate 
in  the  case  of  sublincar,  linear,  and  superlinear  convergence  regime  for  the  optimization 
algorithm. 

1  Introduction 

Sample  average  approximation  (SAA)  is  a  frequently  used  approach  to  solving  stochastic 
programs  where  an  expectation  of  a  random  function  in  the  objective  function  is  replaced 
by  a  sample  average  obtained  by  Monte  Carlo  sampling.  The  approach  is  appealing  due 
to  its  simplicity  and  the  fact  that  a  large  number  of  standard  optimization  algorithms  are 
often  available  to  optimize  the  resulting  sample  average  problem.  It  is  well  known  that 
under  relatively  mild  assumptions  global  and  local  minimizers  as  well  as  stationary  points 
of  the  sample  average  problem  and  the  corresponding  objective  function  values  tend  to  the 
corresponding  points  and  values  of  the  stochastic  program  almost  surely  as  the  sample  size 
increases  to  infinity.  The  asymptotic  distribution  of  minimizers,  minimum  values,  and  related 
quantities  for  the  sample  average  problem  are  also  known  under  additional  assumptions.  We 
refer  to  Chapter  5  of  [30]  for  a  comprehensive  presentation  of  results  and  [32,  29]  for  recent 
advances. 
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In  view  of  the  prevalence  of  uncertainty  in  planning  problems,  stochastic  programs 
are  formulated  and  solved  by  the  SAA  approach  in  a  broad  range  of  applications  such  as 
stochastic  vehicle  allocation  and  routing  [21,  31,  20],  electric  power  system  planning  [21,  20], 
telecommunication  network  design  [21,  20],  financial  planning  [28,  1,  32],  inventory  control 
[32],  mixed  logit  estimation  models  [4],  search  theory  [29],  and  engineering  design  [27,  29]. 

A  main  difficulty  with  the  approach  concerns  the  selection  of  an  appropriate  sample 
size.  At  one  end,  a  large  sample  size  provides  small  discrepancy  in  some  sense  between  the 
stochastic  program  and  the  sample  average  problem,  but  results  in  a  high  computational 
cost  as  objective  function  and  (sub)gradient  evaluations  in  the  sample  average  problem 
involve  the  averaging  of  a  large  number  of  quantities.  At  the  other  end,  a  small  sample 
size  is  computationally  inexpensive  as  the  objective  function  and  (sub)gradient  evaluations 
in  the  sample  average  problem  can  be  computed  quickly,  but  yields  poor  accuracy  as  the 
sample  average  only  coarsely  approximates  the  expectation.  It  is  usually  difficult  to  select  a 
sample  size  that  balances  accuracy  and  computational  cost  without  extensive  trial  and  error. 
This  paper  examines  different  policies  for  sample-size  selection  given  a  particular  computing 
budget. 

The  issue  of  sample-size  selection  arises  in  most  applications  of  the  SAA  approach. 
In  this  paper,  however,  we  focus  on  stochastic  programs  where  the  corresponding  sample 
average  problems  are  solvable  by  a  deterministic  optimization  algorithm  with  known  rate 
of  convergence  such  as  in  the  case  of  subgradient,  gradient,  and  Newtonion  methods.  This 
situation  includes,  for  example,  two-stage  stochastic  programs  with  continuous  first-stage 
variables  and  a  convex  recourse  function  [15],  conditional  value-at-risk  models  [28,  32],  and 
programs  with  convex  smooth  random  functions.  We  do  not  deal  with  integer  restrictions, 
which  usually  imply  that  the  sample  average  problem  is  solvable  in  finite  time,  and  random 
functions  whose  evaluation,  or  that  of  its  subgradient,  gradient,  and  Hessian  (when  needed), 
is  difficult  due  to  an  unknown  probability  distribution  or  other  complications.  We  also  do 
not  deal  with  chance  constraints,  i.e.,  situations  where  the  feasible  region  is  given  in  terms 
of  random  functions;  see  for  example  Chapter  4  of  in  [30].  We  observe  that  there  are  several 
other  approaches  to  solving  stochastic  programs  (see  for  example  [10,  14,  13,  17,  2,  3,  16,  24, 
22]).  However,  this  paper  deals  with  the  SAA  approach  exclusively. 

There  appears  to  be  only  a  few  studies  dealing  with  the  issue  of  determining  a  com¬ 
putationally  efficient  sample  size  within  the  SAA  approach.  Sections  5.3.1  and  5.3.2  of  [30] 
provide  estimates  of  the  required  sample  size  to  guarantee  that  a  set  of  near-optimal  solu¬ 
tions  of  the  sample  average  problem  is  contained  in  a  set  of  near-optimal  solutions  of  the 
stochastic  program  with  a  given  confidence.  While  these  results  provide  useful  insights  about 
the  complexity  of  solving  the  stochastic  program,  the  sample-size  estimates  are  typically  too 
conservative  for  practical  use.  The  authors  of  [5]  efficiently  estimate  the  quality  of  a  given 
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sequence  of  candidate  solutions  by  Monte  Carlo  sampling  using  heuristically  derived  rules  for 
selecting  sample  sizes,  but  do  not  deal  with  the  sample  size  needed  to  generate  the  candidate 
solutions. 

In  the  context  of  a  variable  SAA  approach,  where  not  only  one,  but  a  sequence  of 
sample  average  problems  are  solved  with  increasing  sample  size,  [26]  constructs  open-loop 
sample-size  control  policies  using  a  discrete-time  optimal  control  model.  That  study  deals 
with  linearly  convergent  optimization  algorithms  and  cannot  guarantee  that  the  sample-size 
selections  are  optimal  in  some  sense.  However,  the  resulting  sample-size  control  policies 
appear  to  lead  to  substantial  computational  savings  over  alternative  selections. 

The  recent  paper  [25]  also  deals  with  a  variable  SAA  approach.  It  defines  classes  of 
“optimal  sample  sizes”  that  best  balance,  in  some  asymptotic  sense,  the  sampling  error  due 
to  the  difference  between  the  stochastic  program  and  the  sample  average  problem  with  the 
optimization  error  caused  by  approximate  solution  of  the  sample  average  problems  by  an 
optimization  algorithm.  If  the  rate  of  convergence  of  the  optimization  algorithm  is  high, 
the  optimization  error  will  be  small  relative  to  that  generated  by  an  optimization  algorithm 
with  slower  rate  for  a  given  computing  budget.  The  paper  [25]  gives  specific  guidance  how  to 
select  sample  sizes  tailored  to  optimization  algorithm  with  sublinear,  linear,  and  superlinear 
rate  of  convergence. 

The  simulation  and  simulation  optimization  literature  (see  [7]  for  a  review)  deals  with 
how  to  optimally  allocate  effort  across  different  task  within  the  simulation  given  a  specific 
computing  budget.  The  allocation  may  be  between  exploration  of  different  designs  and 
estimation  of  objective  function  values  at  specific  designs  as  in  global  optimization  [9,  12], 
between  estimation  of  different  random  variables  nested  by  conditioning  [19],  or  between 
estimation  of  different  expected  system  performances  in  ranking  and  selection  [8].  These 
studies  typically  define  an  optimal  allocation  as  one  that  makes  the  estimator  mean-squared 
error  vanish  at  the  fastest  possible  rate  as  the  computing  budget  tends  to  infinity. 

The  present  paper  is  related  to  these  studies  from  the  simulation  and  simulation  opti¬ 
mization  literature,  and  in  particular  the  recent  paper  [25].  As  in  [25],  we  consider  optimiza¬ 
tion  algorithms  with  sublinear,  linear,  and  superlinear  rate  of  convergence  for  the  solution 
of  the  sample  average  problem.  However,  we  adopt  more  specific  assumptions  regarding 
these  rates  than  in  [25]  and  consider  errors  in  objective  function  values  instead  of  solutions, 
which  allow  us  to  avoid  the  potentially  restrictive  assumption  about  uniqueness  of  optimal 
solutions.  Our  assumptions  are  satisfied  by  standard  optimization  algorithms  such  as  many 
subgradient,  gradient,  and  Newtonian  methods  and  allow  us  to  develop  refined  results  re¬ 
garding  the  effect  of  various  sample-size  selection  policies.  For  algorithms  with  a  sublinear 
rate  of  convergence  with  optimization  error  of  order  n~p ,  where  n  is  the  number  of  itera¬ 
tions,  we  examine  the  effect  of  the  parameter  p  >  0.  For  linear  algorithms  with  optimization 


3 


error  of  order  9n,  we  study  the  influence  of  the  rate  of  convergence  coefficient  9  G  (0,1). 
For  superlinear  algorithms  with  optimization  error  of  order  0^",  the  focus  is  on  the  power 
if  >  1  and  also  the  secondary  effect  due  to  9  >  0.  We  determine  the  rate  of  convergence  of 
the  SAA  approach  as  the  computing  budget  tends  to  infinity,  accounting  for  both  sampling 
and  optimization  errors.  Specifically,  we  view  the  value  obtained  after  a  finite  number  of 
iterations  of  an  optimization  algorithm  as  applied  to  a  sample  average  problem  with  a  finite 
sample  size  as  an  estimator  of  the  optimal  value  of  the  original  stochastic  program,  and 
examine  the  convergence  rate  of  the  estimator  as  the  computing  budget  tends  to  infinity. 
To  our  knowledge,  there  has  been  no  systematic  study  of  this  estimator,  its  convergence 
rate,  and  the  influence  of  various  sample-size  selection  policies  on  the  rate.  We  determine 
optimal  policies  in  a  sense  described  below  that  lead  to  rates  of  convergence  of  order  c~u  for 
0  <  v  <  1/2,  (c/  log  c)-1/2,  and  (c/loglogc)_1//2  as  the  computing  budget  c  tends  to  infinity 
for  sublincar,  linear,  and  superlinear  optimization  algorithm,  respectively.  In  the  linear  case, 
we  also  determine  a  policy  with  rate  of  convergence  similar  to  the  sublinear  case  that  is 
robust  to  parameter  misspecihcation. 

The  paper  is  organized  as  follows.  The  next  section  presents  the  stochastic  program,  the 
associated  sample  average  problem,  as  well  as  underlying  assumptions.  Sections  3  to  5  con¬ 
sider  the  cases  with  sublincar,  linear,  and  superlinear  rate  of  convergence  of  the  optimization 
algorithm,  respectively.  Section  6  presents  numerical  examples  illustrating  the  sample-size 
selection  policies. 

2  Problem  Statement  and  Assumptions 

We  consider  a  probability  space  (O,  *7-",  IP) ,  with  Q  C  IRfc,  a  nonempty  compact  subset 
X  C  IRcZ,  and  the  function  /  :  X  — >  IR  defined  by 

f(x)  =  nF(x,co)}, 

where  IE  denotes  the  expectation  with  respect  to  IP  and  F  :  Etd  x  Q,  — >  IR  is  a  random 
function.  The  following  assumption,  which  ensures  that  /(•)  is  well-defined  and  finite  valued 
as  well  as  other  properties,  is  used  throughout  the  paper. 

Assumption  1  We  assume  that 

(i)  the  expectation  !E[F(a:,a;)2]  <  oo  for  all  x  G  X  and 

(ii)  there  exists  a  measurable  function  C  :  £1  — >  IR+  such  that  E^C^o;)2]  <  oo  and 

\F(x,u>)  —  F(x',u>) |  <  C(uj)\\x  —  x'\\ 
for  all  x,  x'  G  X  and  almost  every  well. 
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In  view  of  Theorems  7.43  and  7.44  in  [30],  /(•)  is  well-defined,  finite- valued,  and  Lipschitz 
continuous  on  X.  We  observe  that  weaker  assumptions  suffice  for  these  properties  to  hold; 
see  [30],  pp.  368-369.  However,  in  this  paper  we  utilize  a  central  limit  theorem  and  therefore 
adopt  these  light-tail  assumptions  from  the  beginning  for  simplicity  of  presentation. 

We  consider  the  stochastic  program 

P  :  min  f(x), 

XdX 

which  from  the  continuity  of  /(•)  and  compactness  of  X  has  a  finite  optimal  value  denoted 
by  /*•  We  denote  the  set  of  optimal  solutions  of  P  by  A"*. 

In  general,  f(x)  cannot  be  computed  exactly,  and  we  approximate  it  using  a  sample 
average.  We  let  Q  —  x  U  x  ...  be  the  sample  space  corresponding  to  an  infinite  sequence 
of  sample  points  and  let  IP  be  the  probability  distribution  on  generated  by  IP  under 
independent  sampling.  We  denote  subelements  of  w  G  H  by  ud  G  U,  j  —  1,2, ...,  i.e.,  u  = 
(it;1,  w2, ...).  Then,  for  m  G  IN  =  {1,  2,  3, ...},  we  define  the  sample  average  fm  :  IRd  x  Q  — >  IR 
by 

m 

)  =  ^2  F{x,u3)/m. 

3= 1 

Various  sample  sizes  give  rise  to  a  family  of  approximations  of  P.  Let  {Pm(u;)}meiN  be 
this  family,  where,  for  any  m  G  IN,  the  sample  average  problem,  Pm(ca)  is  defined  by 

Pra(w)  :  min  fm(x,uJ). 

x£X 

Under  Assumption  1  (and  also  under  weaker  assumptions),  is  Lipschitz  continuous 

on  X  for  almost  every  cJ  £  Q.  Hence,  P.m(o;)  has  a  finite  optimal  value  for  almost  every 
w  6  fi,  which  we  denote  by  f*n(co). 

The  SAA  approach  consists  of  selecting  a  sample  size  m,  generating  a  sample  uJ,  and 
then  approximately  solving  Pm(cu)  using  an  appropriate  optimization  algorithm.  (In  practice, 
this  process  may  be  repeated  several  times,  possibly  with  variable  sample  size,  to  facilitate 
validation  analysis  of  the  obtained  solutions  and  to  reduce  the  overall  computing  time;  see 
for  example  Section  5.6  in  [30]  and  [29].  However,  in  this  paper,  we  focus  on  a  single 
replication.)  A  finite  sample  size  induces  a  sampling  error  f*n(p )  —  /*,  which  typically  is 
nonzero.  However,  as  the  sample  size  m  — »  oo,  the  sampling  error  vanishes  in  some  sense  as 
the  following  proposition  states,  where  N(p,a 2)  stands  for  a  normal  random  variable  with 
mean  p  and  variance  a2,  a2(x )  for  Var[F(x,ui)],  and  for  weak  convergence. 

Proposition  1  If  Assumption  1  holds,  then 

-  n  =*•  inf.  JV(0,ff2M), 

ye  A* 

as  m  — >•  oo. 
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Proof:  The  result  follows  directly  from  Theorem  5.7  in  [30]  as  Assumption  1  implies  the 
assumption  of  that  theorem.  □ 

Unless  Pm(cJ)  possesses  a  special  structure  such  as  in  the  case  of  a  linear  or  quadratic 
program,  it  cannot  be  solved  in  finite  computing  time.  Hence,  the  SAA  approach  is  also 
associated  with  an  optimization  error.  Given  a  deterministic  optimization  algorithm,  let 
Am(x,uj)  be  the  solution  obtained  after  n  G  IN  iterations  of  that  optimization  algorithm, 
starting  from  x  G  X,  as  applied  to  Pm(o;).  We  assume  that  A^x,  u)  is  a  random  vector 
for  any  n,m  G  IN  and  x  G  X,  with  A^n(x,uJ),uJ)  =  x.  The  optimization  error  is  then 
fm(Am(x,u),uj)  —  fffuj).  If  the  optimization  algorithm  converges  to  a  globally  optimal 
solution  of  Pm(u;),  then  the  optimization  error  vanishes  as  n  — >  oo.  However,  the  rate  with 
which  it  vanishes  depends  on  the  rate  of  convergence  of  the  optimization  algorithm. 

In  this  paper,  we  examine  the  trade-off  between  sampling  and  optimization  errors  in 
the  SAA  approach  within  a  given  computing  budget  c.  A  large  sample  size  ensures  a  small 
sampling  error,  but,  due  to  the  computing  budget,  restricts  the  number  of  iterations  of  the 
optimization  algorithm  causing  a  potentially  large  optimization  error.  Similarly,  a  large 
number  of  iterations  may  result  in  a  large  sampling  error.  Naturally,  therefore,  the  choice 
of  sample  size  and  number  of  iterations  could  depend  on  the  computing  budget  and  we 
sometimes  write  m(c)  and  n(c)  to  indicate  this  dependence.  We  refer  to  {(m(c), n(c)}ce]N, 
with  n(c),m(c)  G  IN  for  all  c  G  IN,  and  n(c),m(c)  — *  oo,  as  c  — *  oo,  as  an  allocation  policy. 
An  allocation  policy  specifies  the  number  of  iterations  and  sample  size  to  adopt  for  a  given 
computing  budget  c.  We  observe  that  the  focus  on  unbounded  sequences  for  both  n(c)  and 
m(c)  is  not  restrictive  as  we  are  interested  in  situations  where  infinite  number  of  iterations 
and  sample  size  are  required  to  ensure  that  both  the  optimization  and  sampling  errors  vanish. 

The  specifics  of  the  trade-off  between  sampling  and  optimization  errors  depends  on  the 
computational  effort  needed  to  carry  out  n  iterations  of  the  optimization  algorithm  as  a 
function  of  m.  We  adopt  the  following  assumption. 

Assumption  2  For  any  n,m  G  IN,  x  G  X,  and  ui  G  O,  the  computational  effort  to  obtain 
A^XjCiJ)  is  nm. 

Assumption  2  is  reasonable  in  view  of  the  fact  that  each  function,  (sub)gradient,  and  Hessian 
evaluation  of  the  optimization  algorithm  when  applied  to  Pm(u;)  requires  the  summation  of 
m  quantities.  Hence,  the  effort  per  iteration  would  be  proportional  to  m.  This  linear 
growth  in  m  has  also  been  observed  empirically;  see,  e.g.,  p.  204  in  [30].  Assuming  that 
each  iteration  involves  approximately  the  same  number  of  operations,  which  is  the  case  for 
single-point  algorithm  such  as  the  subgradient,  steepest  descent,  and  Newton’s  methods,  the 
computational  effort  to  carry  out  n  iterations  would  be  proportional  to  nm.  We  observe  that 
we  could  replace  nm  by  'ynm,  where  7  is  a  constant,  in  Assumption  2.  However,  this  simply 


6 


amounts  to  a  rescaling  of  the  computing  budget  and  has  no  influence  on  the  subsequent 
analysis.  An  allocation  policy  { (n (c),  m(c)) }ceiN  that  satisfies  n(c)m(c)/c  — *  1  as  c  — *  oo  is 
asymptotically  admissible.  Hence,  an  asymptotically  admissible  policy  will,  at  least  in  the 
limit  as  c  tends  to  infinity,  satisfy  the  computing  budget. 

The  two  kinds  of  errors,  due  to  sampling  and  optimization,  contribute  to  the  mean- 
squared  error  MSE(/m(c)(A”((cc))(a:,a;))  =  u),  u)  -  /*)2].  In  view  of  the 

discussion  above,  the  MSE  vanishes,  in  some  sense,  under  mild  assumptions  as  c  tends  to 
infinity.  Of  course,  there  is  a  large  number  of  asymptotically  admissible  allocation  policies, 
and  ensuing  convergence  rates.  In  the  next  three  sections  we  analyze  the  estimator  conver¬ 
gence  rate  under  the  assumption  of  sublinear,  linear,  and  superlinear  rate  of  convergence  of 
the  optimization  algorithm. 

We  use  the  following  standard  ordering  notation  in  the  remainder  of  the  paper.  A 
sequence  of  random  variables  {^jneiN  is  Ov{  1)  if  for  all  (  >  0  there  exists  a  constant  e 
such  that  P(|£n|  >  e)  <  (  for  all  n  sufficiently  large.  Similarly,  the  sequence  is  Op( 0)  if  for 
e  >  0  arbitrary  P( |£n|  >  e)  — >  0,  as  n  — *  oo.  A  deterministic  sequence  {£n}neiN  is  0(an), 
for  {an}neiN  a  positive  sequence,  if  |£n|/a„  is  bounded  by  a  finite  constant.  We  also  write 
~  an  if  fn/a>n  tends  to  a  finite  constant  as  n  — >  oo. 

3  Sublinearly  Convergent  Optimization  Algorithm 

Suppose  that  the  deterministic  optimization  algorithm  used  to  solve  Pm(a;)  converges  sub¬ 
linearly  as  stated  in  the  next  assumption. 

Assumption  3  There  exists  a  p  >  0  and  a  family  of  measurable  functions  Km  :  fl  — >  1R+ , 
m  G  IN,  such  that  Km(cd)  =>  K  e  [0,  oo)  ,  as  m  — *  oo,  and 

UA"m(z,TJ),u)  -  C(u)  <  Pbd 

for  all  x  G  X,  n,m  G  IN,  and  almost  every  uj  G  0. 

Several  standard  algorithms  satisfy  Assumption  3  when  Pm(ca)  is  convex.  For  example,  the 
subgradient  method  satisfies  Assumption  3  with  p  =  1/2  and  Km(uj)  =  Dx  Y^j= i  C(Lui)/m, 
where  C(co)  is  as  in  Assumption  1  and  Dx  =  max^/gx  ||x  — x'||;  see  [23],  pp.  142-143.  When 
F(-,oj)  is  Lipschitz  continuously  differentiable,  Nesterov’s  optimal  gradient  method  satisfies 
Assumption  3  with  p  —  2  and  Km(ui )  proportional  to  the  product  of  the  average  Lipschitz 
constant  of  VxF(-,u)  on  X  and  D\ ;  see  p.  77  of  [23]. 

In  view  of  Assumption  3  and  the  optimality  of  /r*  (oJ)  for  Pm(uJ), 

/m(w)  -  f  <  fm(Anm(x,Ui),Uj)  ~  /*  <  /*  (w)  -  /*  +  Km{W)/rP,  (1) 
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for  every  n,  m  G  IN,  x  G  X,  and  almost  every  ui  G  fh  It  follows  that  MSE(/m(A^(:r,  cJ),cJ))  < 
max{MSE(/^(u;)), MSE(/^(a;)  +  Km(uJ)/np)}.  This  suggests  that  a  good  allocation  policy 
should  balance  the  sampling  error  f*n(oj)  —  /*,  which  contributes  to  both  maximands  and 
decays  at  rate  m-1/2,  and  the  bias  term  Km(u)/np  due  to  the  optimization.  More  precisely, 
since  under  Assumption  2  increasing  n  and  m  are  equally  computationally  costly,  we  would 
like  to  select  an  asymptotically  admissible  allocation  policy  {(n(c),m(c))}ce]N  such  that 
m(c)-1/2  ~  n(c)~p.  This  discussion  is  formalized  in  the  next  theorem,  where  to  simplify  the 
notation  we  write  n  and  m  instead  of  n(c)  and  m(c),  respectively. 


Theorem  1  Suppose  that  Assumptions  1,  2,  and  3  hold,  x  G  X,  {(n(c),  m(c) )}censr  is  an 
asymptotically  admissible  allocation  policy,  and  n(c)  /  c1^2p+l'>  — >  a ,  with  a  G  (0,  oo),  as 
c  — y  oo .  Then 


c?/Cr+1\UAl(x,ui),u)  -  f)  =  0,(1). 


Proof.  By  assumption,  m  — >  oo,  as  c  G  oo,  and  hence,  in  view  of  Proposition  1, 
ml^2(/m(^)  —  /*)  =>•  infygx*  X(0,  <j2(t/)),  as  c  — >  oo.  Let  r(c)  =  cp^2p+1\  In  view  of 
Eq.  (1), 

r(c)(fm(u)  -  /*)  <  r(c)(fm(A^(x,u),u)  -  /*)  <  r(c)(f*n(u)  -  f*  +  Km{p)/np)  (2) 
for  all  c  G  IN  and  almost  every  oJ  G  Q. 

Since  the  policy  is  asymptotically  admissible,  mn/c  — >■  1,  as  c  — >  oo.  Moreover, 
n/c1/(2p+1)  GaG  (0,  oo),  as  c  — *  oo,  by  assumption.  We  find  that 


r(c)  /  c  A/2  /  n  W/2 
m1/2  v  nrn /  \c1d2P+1)/ 


as  c  — »  oo.  Similarly, 


J  \cl/(2p+l) 

r(c )  /c1/(2p+1)'^p 


— )■  a 


1/2 


nP 


n 


-G  a"p, 


as  c  — >  oo. 

Then,  by  a  convergent  together  argument  (p.  27  of  [6]), 

r(c)(/m(^)  -  /*  +  Am(u;)/np)  =►  +  a1/2  inf  X(0,  a2(y )), 

yex* 

as  c  — >  oo,  where  iP  is  as  in  Assumption  3,  and 

r(c)(/m(^)  -  /* )  =>  «1/2  inf  X(0,a2(i/)), 

y&X* 

as  c  — *  oo.  From  Eq.  (2), 

p(c)|/m(A”  (x,a7),w)  -  /*|  <  max{r(c)|/^(uJ)  -  f*\,r(c)\f*m(uj)  -  f*  +  A"m(ch)/np|} 


(3) 


(4) 


for  almost  every  u j  6  (1.  Hence,  in  view  of  Eqs.  (3)  and  (4),  for  any  e  >  0  there  exists  a 
constant  C  >  0  such  that 

¥ (r(c)\f*m(cJ)  -  f*\  >  C)  <  €- 

and 

F  (r(c)|/*  (57)  -  /*  +  KmP)/rf\  >  C)  <  | 
for  all  sufficiently  large  c.  Therefore, 


JP(r(c)\fm(Anm(x,u),u)-r\>C) 

<  F  ( max  (r(c) |  f*n (57)  ^/*  | ,  r(c) | (57)  -  f*  +  Km (57) /np |  >  C) 

<  IP  (r(c)|/*  (57)  -  /I  >  C)  +  F  (r(c)|/*  (57)  -  /*  +  iU7>KI  >  C) 


e  e 


for  c  sufficiently  large.  □ 

We  see  from  Theorem  1  that  for  any  finite  p  >  0,  the  convergence  rate  of  fm(c)  (x,  U) 
is  worse  than  the  canonical  rate  c^1/2  when  only  sampling  is  considered;  see  Proposition  1. 
Hence,  c1l2~p^2p+1^  =  <41/,2^2p+1')  is  the  “cost  of  optimization.”  Of  course,  if  p  is  large, 
that  cost  is  moderate.  However,  if  p  =  1/2  as  for  the  subgradient  method,  then  the  cost  of 
optimization  is  c1/4  and  the  convergence  rate  is  of  order  c~4/4. 

It  follows  from  the  proof  of  Theorem  1  that  if  n  grows  slower  or  faster  than  c1'/('2p+1^  then 
the  convergence  rate  is  slower  than  cp^2p+1f  Indeed,  it  is  easy  to  see  that  n  ~  c1/(2P+1)+ej 
for  e  >  0,  results  in  a  convergence  rate  of  order  cp^2p+l^~e^2 ,  while  n  ~  c1/(2P+1)~e)  for 
0  <  e  <  1/(2 p  +  1),  leads  to  a  convergence  rate  of  order  cp^2p+1^~pe.  This  lends  support  to 
our  statement  that  n  ~  cV^p+P  is  the  optimal  allocation  rule.  Conveniently,  however,  the 
rate  of  convergence  is  robust  to  the  choice  of  a. 


4  Linearly  Convergent  Optimization  Algorithm 

Suppose  that  the  deterministic  optimization  algorithm  used  to  solve  Pm(u;)  converges  lin¬ 
early  with  a  rate  of  convergence  coefficient  independent  of  oj  and  m  as  stated  in  the  next 
assumption. 

Assumption  4  There  exists  a  9  E  ( 0,1)  such  that 

fm(An(x,u),u)  -  /*(57)  <  0(/m(^m_1(z,57),57)  -  /^(57)) 
for  all  x  E  X,  n,m  E  IN,  and  almost  every  wGll. 
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Assumption  4  is  satisfied  by  many  gradient  methods  such  as  the  steepest  descent  method 
and  projected  gradient  method  when  applied  to  Pm(ca)  under  the  assumption  that  F(-,oj)  is 
strongly  convex  and  twice  continuously  differentiable  for  almost  every  u;  G  and  that  X  is 
convex.  Moreover,  the  requirement  in  Assumption  4  that  the  rate  of  convergence  coefficient 
9  holds  “uniformly”  for  almost  every  U  G  hi  follows  when  the  largest  and  smallest  eigenvalue 
of  the  Hessian  of  F(-,u)  are  bounded  from  above  and  below,  respectively,  with  bounds 
independent  of  u.  The  requirement  of  a  twice  continuously  differentiable  random  function 
excludes  at  first  sight  two-stage  stochastic  programs  with  recourse  [15],  conditional  Value-at- 
Risk  minimization  problems  [28] ,  inventory  control  problems  [32] ,  complex  engineering  design 
problems  [27],  and  similar  problems  involving  a  nonsmooth  random  function.  However, 
these  nonsmooth  functions  can  sometimes  be  approximated  with  high  accuracy  by  smooth 
functions  [1,  32,  29].  Hence,  the  results  of  this  section  as  well  as  the  next  one,  dealing  with 
superlinearly  convergent  optimization  algorithms,  may  also  be  applicable  in  such  contexts. 

In  view  of  Assumption  4,  we  find  that 

rja)  -r<  uAnm(x,m),u)  -r<  /;h  -  r + rum(x,u)  -  c  m)  © 

for  every  n,  m  G  IN  and  almost  every  uJ  G  Q.  As  in  the  sublinear  case,  a  judicious  approach 
to  ensure  that  MSE(/m(A”  (x, ca), ca))  decays  at  the  fastest  possible  rate  is  to  equalize  the 
sampling  and  optimization  error  decay  rates.  Since  the  first  term  decreases  at  a  rate  vrTx^2 
and  the  second  term  at  a  rate  9n ,  an  allocation  with  n(c)  ~  logc  and  m(c)  ~  c/logc  meets 
this  criterion.  The  following  theorem  makes  rigorous  this  argument. 

Theorem  2  Suppose  that  Assumptions  1,  2,  and  4  hold,  x  G  X,  {(n(c),  m(c) )}censr  is  an 
asymptotically  admissible  allocation  policy,  and  n(c )  —  a  logc  — >■  0,  with  a  >  0,  as  c  — *  oo. 

(i)  If  a  >  (2  log(l/#))-1,  then,  as  c  — >■  oo, 

/  .  \  V2 

V^g^j  (fm(Anm(x,ui),ui)  -  f*)  =>  MtN(0,a2(y)). 

(ii)  If  0  <  a  <  (2  log(l/#))-1,  then 

c“1°*(#-1)(/m(^”  (x,u),u)  -  n  =  Or(  1). 

Proof.  Since  the  policy  is  asymptotically  admissible,  mn/c  — >  1,  and  also  m  oo,  as 
c  — *  oo.  This  fact  and  the  Central  Limit  Theorem  imply  that  m1i2(fm(x,uj)  —  f(x))  => 
N(0,  a2(x)),  as  c  — >  oo.  Proposition  1  results  in  ml^2{ffn{u)  —  /*)  infyex*  77(0,  cr2(y)),  as 
c  — >  oo. 

For  any  r  :  IN  — >  IR+, 

r(c)(/™(w)  +  9n(fm(x,uj)  -  /*,(ch))  -  /*) 

=  r(c)  Kf4n(uJ)  -  /*)  +  en(fm(x ,  u)  -  f(x)  +  f*-  rm{u))  +  9n(f(x)  -  f*)}  (6) 
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for  every  u;  G  fh 

First  we  consider  part  (i)  and  let  r(c)  =  (c/(o log  c))1/2,  with  a  >  (2  log(l/0))_1.  By  the 
assumption  on  a, 


r(c)9n  =  r(c)e 


(n—a  log  c)  log  0  ^ a  log  c  log  0  _ 


/  i  \  1/2 

/  1  j  cl/2-alog(0  1  )e(n-a log  c)  log  6>  q 

Valogcy 


as  c  — *  oo,  so  that  r(c)9n /m1^2  — >  0.  Analogously, 


r(c) 


1/9  =  rlc 

'  \mn 


c  x1/2  /nx1/2  /  c  x  V2  (n  —  alogc 


mn 


X  V2 


1  +  1  )  “" *■  1> 

a  log  c  / 


as  c  — >  oo. 

Then,  by  Eq.  (6)  and  a  converging  together  argument  (p.  27  of  [6]), 


r(c)  (/;»  -  /*  +  r  (/m(x,  w)  -  /^H))  =>  inf  N(0,  a2(y)), 

-  X  * 


and 

m V2 

r(c)  (fmfa)  ~  /*)  =  r(c)^I77  (/m(^)  -  /* )  =>  N(0,  a2(y )), 

as  c  — >  oo.  In  view  of  Eq.  (5),  the  conclusion  of  part  (i)  follows. 

Second,  we  consider  part  (ii)  and  let  r(c)  =  calog^  with  0  <  a  <  (2  log(l/#))-1.  Then, 


r(c\Qn  —  rfc\e(n-alogc)logdealogclo gO  _  g(n-a log c)  log 0  ^ 


as  c  — *  oo.  Also, 
r(c) 


c  x  V2  /nx  V2 

1/2  =  nci — 

rn'iz  \nm /  Vc/ 


=  V/2(alogc)1/2  ( n-,al°SC  +  lV/2  -9  0, 

\nmJ  \  alogc  ) 


as  c  — >  oo.  Consequently, 


r(c)9T 


m 


1/2 


— t  o, 


as  c  — »  oo.  By  Eq.  (6)  and  a  converging  together  argument, 


and 


r(c)  (fn(w)  +  9n(fm(x,uj)  -  /*(w))  -  /*)  =»  /(x)  -  /*, 


Kc)  (/mM  -  D  =*  °> 


(7) 

(8) 


as  c  — *  oo.  From  Eq.  (5), 


I  fm{An{x,u),u)  -  f* |  <  max{|/^(cn) 


/* I,  I /*  (w)  +  <9"(/m(x,o;)  -  /*  (u;)))  -  /*|} 
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for  almost  every  w  6  H.  From  here  on  the  argument  follows  the  last  part  of  the  proof  of 
Theorem  1,  and  is  omitted. 

□ 

In  view  of  Theorem  2,  we  see  that  if  a  >  (21og(l/0))  \  then  u),u)  tends 

to  f*  at  a  rate  (c/ log  bffi1//2,  which  is  slower  than  the  canonical  rate  c~1,/2  in  the  case  of 
sampling  only;  see  Proposition  1.  Hence  (logc)1/2  can  be  viewed  as  the  cost  of  optimization. 
We  note  that  the  best  choice  of  a  is  (21og(l/0))-1  and  that  the  convergence  rate  worsens 
significantly  when  a  <  (21og(l/0))_1.  Specifically,  for  a  =  <5(21og(l/0))_1,  with  <5  e  (0,1), 
the  rate  is  c-5/2,  which  is  slower  than  (c/log)-1/2  for  all  c  sufficiently  large. 

If  X*  is  a  singleton  and  a  =  (21og(l/0))_1,  then 

/  \  1/2 

\iogcJ  ~  /*)  =>  N(°’  (-2iog0)_V2(o;*)), 

as  c  — >  oo,  which  can  be  used  to  construct  confidence  interval  for  /*  if  9  is  known  and  <j2(x*) 
can  be  estimated. 

Often  the  rate  of  convergence  coefficient,  9,  of  the  optimization  algorithm  is  theoret¬ 
ically  known  as  in  the  case  of  the  steepest  descent  and  projected  gradient  methods  with 
Armijo  step  size  rule  (see  Section  6)  and/or  it  can  be  accurately  estimated  from  prelimi¬ 
nary  calculations  using  the  optimization  algorithm;  see  [26].  If  the  theoretical  value  of  9 
is  excessively  conservative  relative  to  the  actual  progress  made  by  the  algorithm  or  pre¬ 
liminary  calculations  are  impractical  or  unreliable,  then  it  may  be  problematic  to  use  the 
best  allocation  policy  recommended  by  Theorem  2,  i.e. ,  selecting  {(71(c),  m(c)}ng]N  such  that 
n(c )  —  (21og(l/0))_1logc  — »  0  and  n(c)m(c)/c  — »  1,  as  c  — *  oo.  A  slight  underestimation 
of  9  would  result  in  substantially  slower  rate  as  indicated  by  part  (ii)  of  that  theorem.  In 
such  a  situation,  it  may  be  prudent  to  select  a  more  conservative  allocation  policy  that  sat¬ 
isfies  n (c)  ~  cv  for  0  <  v  <  1,  which  guarantees  the  same  convergence  rate  regardless  of 
the  value  of  9 ;  this  is  the  same  approach  followed  in  a  different  context  in  [19].  The  rate 
is  worse  than  the  optimal  one  of  Theorem  2,  but  better  than  what  can  occur  with  a  poor 
estimate  of  9.  This  conservative  asymptotic  admissible  allocation  policy  is  discussed  in  the 
next  proposition. 

Proposition  2  Suppose  that  Assumptions  1,  2,  and  4  hold,  x  G  X,  {(n(c),  m(c))}ceBsr  is  an 
asymptotically  admissible  allocation  policy,  and  n(c)/cu  — »  a  >  0,  with  0  <  u  <  1,  as  c  — >■  oo. 
Then, 

1  —  v 

C-jH(fm{A^{x,uJ),uJ)  -  f*)  =>  inf  iV(0,u2(7/)), 
a  '  jeA* 

as  c  — »  oo. 
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Proof.  Let  r(c)  =  c  2  /a1/2.  Then, 

r(c)  /  c  \  V2  /n\  V2 

— I77  —  r(c)  ( - )  (_)  1) 

mi  \mn /  Vc/ 


and 


r(c)0n  =  -g  0, 

al/z 


as  c  — »  00.  From  these  results  it  follows  that 

r(c) 

m1/2 


0n  -G  0, 


(9) 


(10) 


(11) 


as  c  — »  00.  Also,  Proposition  1,  Eq.  (9),  and  a  convergence  together  argument  show  that 


r(c)(fm(v)  ~  n  =►  inf  N(0,  cr2(|/)). 

y£X* 


Proposition  1,  the  Central  Limit  Theorem,  Eqs.  (9) — (11) ,  and  a  convergence  together  argu¬ 
ment  show  that 


r(c)9n(fm(x,u)  -  f(x)  +  f*  -  f*M)  =>  0, 


and 

r(c)9n(f(x)  -  D  =►  0, 


as  c  — »  00.  From  this  point  on  the  proof  resembles  that  of  Theorem  2,  and  we  omit  the 
details.  □ 


5  Superlinearly  Convergent  Optimization  Algorithm 

In  this  section,  we  assume  that  the  optimization  algorithm  used  to  solve  Pm(c 0)  is  superlin¬ 
early  convergent  as  defined  by  the  next  assumption. 

Assumption  5  There  exists  a  9  E  ( 0,  00)  and  a  0  G  (1,  00)  such  that 

fm(Al(x,U),US)  -  rjp)  <  e(UA"m\x,W),  57)  -  CP))* 

for  all  x  E  X ,  m,  n  G  IN,  and  almost  every  Ui  G  fL 

Assumption  5  holds  for  Newton’s  method  with  0  =  2  when  applied  to  Pm(uJ)  with  F(-,to) 
being  strongly  convex  and  twice  Lipschitz  continuously  differentiable  for  almost  every  to  G  fi, 
the  starting  point  x  G  X  is  sufficiently  close  to  the  global  minimizer  of  Pm(u;),  and  if  the 
Hessian  of  F(-,to)  and  its  Lipschitz  constant  are  bounded  in  some  sense  as  to  ranges  over 
fL  Cases  with  0  G  (1,  2)  arise  for  example  in  Newtonian  methods  with  infrequent  Hessian 
updates. 
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For  any  u  E  O,  it  follows  by  induction  from  Assumption  5  that 


- r  <  uA"m(X,u),ui) - /•  <  /^(uj)  -r+ -  am*" 


(12) 

Similar  to  the  discussion  of  the  past  sections,  in  order  to  guarantee  that  MSE(/m(A^(a:,  cJ),cJ)) 
decreases  at  the  fastest  possible  rate  we  equalize  the  sampling  error  rate  (of  order  m -1/2) 
with  the  optimization  error  decay  rate  (of  order  (01/W,_1)(/m  (x,u)  ~  /™M))^  as  long  as 
the  element  within  parentheses  is  smaller  than  1).  Equalizing  these  rates  suggests  an  al¬ 
location  policy  with  n(c)  ~  log  log  c.  The  formal  result  is  stated  next,  where  k(x)  = 
log(9~1^'I’~1\f(x)  —  f *)~1),  which  is  positive  for  91^^~1\f(x)  —  /*)  <  1. 


Theorem  3  Suppose  that  Assumptions  1,  2,  and  5  hold,  x  G  X,  {(n(c), m(c))}ceiN  is  an 
asymptotically  admissible  allocation  policy,  and  that  n(c)  —  a  loglog c  =  0((log  c)~alog^),  with 
a  >  0.  Moreover,  suppose  that  x  is  such  that  d1//^_1^(/(a;)  —  /*)  <  1,  where  9  and  if  are  as 
in  Assumption  5. 

(i)  If  a  >  1/  logif  or  if  a  =  1/  logif  and  n{x)  >1/2  then,  as  c  — >•  oo, 

/  \  1/2 

(  —r~\ - )  (/m(^m(®,w),w)  -  /*)  =>  inf tN(0,a2(y)).  (13) 

yaloglogcy  y&x* 

(ii)  If  a  <  1/log  if  or  if  a  =  1/log  if  and  k(x)  <  1/2,  then 

exp  (A(a;)(log c)alog^’)  (fm(A^(x,u),u)  -  /*)  =  Op(  1).  (14) 

Proof.  Let  x  E  X  be  such  that  91^I’~1\f(x)  —  /*)  <  1.  For  any  r  :  IN  — >  IR+,  we  define 

B,(c)  =  e1/(*-1)r(c)'*““ -  f(x)), 

B2(c)  =  0l/tt.-l>r(c)*-"(/.  -  £(57)), 

b3(c)  =  e'A*-»r(c)*-’(f(x)-r). 

Then, 


r(c)  [/;<«)  -  /•  +  e-^-'\e^\fm(x,ix)  - 

TP" 


-81(c)  +  -82(c)  +  63(c) 


(15) 


By  the  Mean  Value  Theorem,  exp((a  log  log  c— n)  log  if)  =  l+((a  log  log  c—n )  log  ^)  exp(£c), 
where  £c  lies  between  (a  log  log  c  —  n)  log  if  and  0.  By  assumption,  supc{£c}  <  00,  so  that 
exp((ologlogc  —  n)  log  if)  =  1  +  0((logc)_alog^).  Therefore, 

log r(c)^  "  —  if~n\ogr(c) 

=  exp((aloglogc  —  n)  log  ^6)  exp(— a  log  V’ log  log  c)  log  r(c) 

(16) 

=  (1  +  0( (log  c)~alos^)) (log  c)~al°^  log r(c) 

=  (log  c)_alog^  log r(c)  +  O((logc)~2alog^  log  r(c)). 


14 


We  first  consider  part  (i),  where  a  >  1/  log?/  or  a  =  1/  log?/  and  k(x)  >1/2.  Let 

1/2 


r  c  = 


a  log  log  c 


In  view  of  Proposition  1  and  the  fact  that 


,s  —i/2 _  (  c  f  n  —  a  log  log  c  1  \  \  1/2 

\c)m  —  I  -  I  - : - : - 1-  1  J  J  ->  1 

\  nm  \  a  log  log  c  J  J 


(17) 


as  c  — >  oo,  we  obtain  by  a  converging  together  argument  that 


r(c)(/;M  -  /*)  =>  inf  lV(0,a2(?/)), 


(18) 


as  c  — *  oo.  By  Eqs.  (12),  (15),  and  (18)  as  well  as  a  sandwich  argument,  the  proof  of  part 
(i)  will  be  complete  once  we  show  that  (8i(c)  +  -82(c)  +  63(0))^™  =>■  0,  or,  what  is  the  same, 
that  for  an  arbitrary  e  >  0,  IP(|8i(c)  +  82(c)  +  63(0)^™  >  e)  — >  0  as  c  — *  00.  To  wit, 


IP  (  1 83(c)  +  82(c)  +  b3(c)\^n  >  e 

=  p  (Bi(c)  +  82(c)  +  63(c)  >  A  |  p  ^8j(c) +  82(c) +  63 (cy  ^  1 


(19) 


From  Eq.  (16)  it  follows  that 


log  r(c)^ 

=  (logc)_al°s^^(logc  —  log(aloglogc))  +  0((logc)1_2alog^) 

=  ^((logc)1_alog^’  —  (logc)_al°g^Togloglogc)  +  0(l/logc).  (20) 


We  observe  that  logr(c)^  ™  =  0(1),  so  that  r(c)^  " /m1/2  — >  0.  Knowing  that 

/*)  =>■  infygx*  iV(0,  <v2(?/))  and  m1/2(fm(x,uj)  —  f(x ))  =>■  iV(0,  (x2(a;)),  a  convergence  together 

argument  results  in  83(c)  =4*  0,  and  82(c)  =>■  0. 

Looking  at  the  63(c)  term,  we  obtain  using  Eq.  (20)  that 

log  63(c)  =  -k(x)  +  ^((logc)^alog^  -  (logc)-"108^  log  log  log  c)  +  0(1/ log  c).  (21) 

Therefore,  if  a  >  1/log?/  then  log  63(c)  — >  —  k(x);  i.e.,  63(c)  — >  61^^,~1\f(x)  —  /*)  <  1.  The 
fact  that  "  — >  1  as  c  — >  00  and  a  convergence  together  argument  then  lead  to 


81(c)  +  82(c)  +  63(c) 

+0^ 


Since  0  <  O1/^  ^(/(x)  —  /*)  <  1,  the  latter  implies  that  the  r.h.s.  of  Eq.  (19)  converges  to 
0  as  c  — y  00. 
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If  a  —  1/  log  'ip  we  get  from  (21)  that 

log  Me)  =  -«(z)  +  5  -  h°Sl0Sl0gC  +  0(1/ log  c),  (22) 

2  2  log  c 

Hence,  if  /c(x)  >  1/2  we  get  via  a  convergence  together  argument  that  the  r.h.s.  of  Eq.  (19) 
converges  to  0.  When  k(x)  =  1/2,  we  need  to  treat  the  e ^  "  term  more  carefully.  Proceeding 
as  in  Eq.  (16)  we  get  that  loge^  "  =  log(e/logc)  +  0((logc)~2).  Hence,  Eqs.  (16)  and  (22) 
yield 

/63(c)  \  llogloglogc 
i0S(^J-—  2  logo  +  Oa/logc), 

meaning  that  bs(c)/e^  <  1  eventually,  so  the  r.h.s.  of  Eq.  (19)  converges  to  0  as  c  — »  00. 
Next  we  consider  part  (ii),  where  a  <  1/ log  ip  or  a  =  1/log?/;  and  k(x)  <  1/2,  and 

r(c)  =  exp  (fi/(a;)(logc)alog^)  . 

Then 


— )■  0, 

as  c  — »  00.  Thus,  in  view  of  Proposition  1  and  a  converging  together  argument  we  get  that 

’■(c)(/m(")-/')=>0,  (24) 

as  c  — *  00. 

Also,  since  exp  (?/_nfi;(a;)(log  c)alog^’)  <  exp  (fi/(a;)(logc)alog^),  Eq.  (23)  results  in 

r(c)^~"  exp  (r/_n/s;(a;)(logc)aIog^) 
m1/2  m1/2  ’ 

as  c  00.  The  Central  Limit  Theorem  and  a  convergent  together  argument  result  in 

Bi(c)  0,  as  c  — *  00.  Just  the  same,  Proposition  1  and  a  converging  together  argument 
show  that  -82(c)  =>■  0,  as  c  — »  00. 

Regarding  the  63(c)  term,  we  obtain  from  Eq.  (16)  that 

log  63(c) 

=  —k(x)  +  (log  c)-alog^  log  r(c)  +  0((log  </p2alog^  log r(c)) 

=  -k(x)  +  {logc)-al°^K(x){logc)al°^  +  O^logc)-"10^) 

=  0((logc)“alog^). 
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Also,  an  argument  similar  to  the  one  leading  to  Eq.  (16),  we  find  that  for  any  e  >  0 

log  e^~"  =  log  e  (log  c)~al°^  +  O((logc)~2alos^). 

Hence,  there  is  a  finite  e  such  that  for  all  c  sufficiently  large,  63(c)  <  ",  meaning  that 

(81(c)  +  -82(c)  +  63(0))^"  =  Op(  1).  In  conclusion,  r(c)(fm(A^l(x,u),  of)  —  /*)  is  bounded 
below  by  an  Op(0)  term  (cf.,  Eq.  (24)),  and  above  by  an  Op(  1)  term.  The  second  statement 
of  the  theorem  now  follows  using  an  argument  similar  to  the  one  employed  in  the  last  part 
of  the  proof  of  Theorem  1.  □ 

We  observe  from  Theorem  3  that  a  should  be  selected  as  1/  log  -0  to  obtain  the  most 
favorable  coefficient  in  the  rate  expression,  assuming  that  the  initial  solution  2:  6  I  is  suf¬ 
ficiently  close  to  the  optimal  solution  of  P.  In  this  case,  the  convergence  rate  is  essentially 
the  canonical  c-1/2,  only  slightly  reduced  with  a  log  log  c  term.  Hence,  in  the  case  of  a 
superlinearly  convergent  optimization  algorithm,  the  cost  of  optimization  is  essentially  neg¬ 
ligible.  If  a  is  selected  smaller,  the  convergence  rate  may  deteriorate,  decaying  at  best  at 
rate  c~K^  >  c^1/2,  as  in  Theorem  2. 

We  see  from  the  above  result  that  a  must  be  chosen  larger  when  -0  approaches  one  to 
maintain  the  best  rate  of  convergence.  Consequently,  n  must  also  be  chosen  larger.  Intu¬ 
itively,  as  0  tends  to  one,  we  expect  that  the  above  result  tends  to  the  one  for  the  linear  case. 
For  example,  suppose  that  -0  is  a  function  of  c.  Specifically,  let  -0  =  exp((loglogc)/logc), 
which  tends  to  1  as  c  — *  00,  and  6  G  (0,1).  Then,  if  a  =  log  c/  log  log  c,  we  obtain  that 
alog^  =  1  and  k(x)  >1/2  for  all  c  sufficiently  large.  Hence,  we  obtain  that 

_ £ _ y/2= w 

a  log  log  c  J  \  log  c 

which  shows  that  the  rate  of  Theorem  3,  part  (i),  tends  to  the  rate  of  Theorem  2  with  a  =  1, 
assuming  that  6  <  exp(— 1/2).  The  rate  is  obtain  in  both  cases  using  n  =  logc. 

6  Numerical  Examples 

We  illustrate  the  above  results  using  two  problem  instances.  We  solve  the  first  problem 
instance,  which  arises  in  the  optimization  of  an  investment  portfolio,  using  the  sublinearly 
convergent  subgradient  method  to  illustrate  the  results  of  Section  3.  The  second  problem 
instance  is  randomly  generated  and  we  solve  it  using  the  linearly  convergent  steepest  descent 
method,  to  illustrate  Section  4,  as  well  as  the  quadratically  convergent  Newton’s  method, 
which  relates  to  Section  5.  We  describe  the  problem  instances  and  the  corresponding  nu¬ 
merical  results  in  turn,  with  Subsections  6.1,  6.2,  and  6.3  illustrating  the  sublinear,  linear, 
and  superlinear  cases,  respectively.  We  implement  the  problem  instances  and  optimization 
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algorithms  in  Matlab  Version  7.9  and  run  the  calculations  on  a  laptop  computer  with  2.26 
GHz  processor,  3.5  GB  RAM,  and  Windows  XP  operating  system. 


6.1  Subgradient  Method 


The  first  problem  instance  is  taken  from  [18]  and  considers  d  —  1  financial  instruments 
with  random  returns  given  by  the  (d  —  l)-dimensional  random  vector  oj  =  R  +  Qu ,  where 
R  =  (.Ri,  R-2, ...,  Rd-i)',  with  Ri  being  the  expected  return  of  instrument  i,  Q  is  an  (d  —  l)-by- 
(d:  —  1)  matrix,  and  u  is  a  standard  normal  (d  —  l)-dimensional  random  vector.  As  in  [18],  we 
randomly  generate  R  using  an  independent  sample  from  a  uniform  distribution  on  [0.9, 1.2] 
and  Q  using  an  independent  sample  from  a  uniform  distribution  on  [0,  0.1].  We  would  like  to 
distribute  one  unit  of  wealth  across  the  d  —  1  instruments  such  that  the  Conditional  Value- 
at-Risk  of  the  portfolio  return  is  minimized  and  the  expected  portfolio  return  is  no  smaller 
than  1.05.  We  let  Xi  G  IR  denote  the  amount  of  investment  in  instrument  i,  i  =  1,  2, d—  1. 
This  results  in  the  random  function  (see  [18,  28]) 

UiXi  ~  ^d,oj  ,  (26) 

where  x  =  ( Xi,x2 ,  ...,Xd)',  with  Xd  G  IR  being  an  auxiliary  decision  variable,  and  t  G  (0, 1)  is 
a  probability  level.  The  feasible  region 


F(x,u)  =  xd  + 


1  -t 


max  <  — 


X 


x  g  nr 


d—  1  d—  1 

Xi  =  1,  RjXi  >  1.05,  Xi  >  OR 

i=  1  i=  1 


We  use  d  =  101  and  t  =  0.9.  The  random  function  in  Eq.  (26)  is  not  continuously  differ¬ 
entiable  everywhere  for  IP-almost  every  u  G  fh  However,  the  function  possesses  a  subdif¬ 
ferential  and  we  consequently  use  the  subgradient  method  with  fixed  step  size  (n  +  l)-1/2, 
where  n  is  the  number  of  iterations.  This  step  size  is  optimal  in  the  sense  of  Nesterov;  see 
[23],  pp.  142-143.  As  stated  in  Section  3,  the  subgradient  method  satisfies  Assumption  3 
with  p  =  1/2  and  Km(uj)  =  Dx  ]C[=i  C(uj3)/rn,  where  C(co)  is  as  in  Assumption  1  and 
Dx  =  maxI)X/€x  ||a;  —  x'\\.  Of  course,  as  pointed  out  in  [18],  this  problem  instance  can  be  re¬ 
formulated  as  a  conic-quadratic  programming  problem  and  solved  directly  without  the  use  of 
sampling.  Hence,  this  is  a  convenient  test  instance  as  we  are  able  to  compute  f*  =  —0.352604 
(rounded  to  six  digits)  using  cvx  [11],  We  use  initial  solution  x  =  (0,  0, ...,  0, 1,  0,  0....,  0,  —  1)', 
where  the  65-th  component  equals  1.  In  our  data,  the  65-th  instrument  has  the  largest 
expected  return.  Hence,  the  initial  solution  is  the  one  with  the  largest  expected  portfolio 
return. 

Figure  1  illustrates  Theorem  2  and  displays  MSE(/m(c)  ( (x,  4;) ,  a;) )  for  the  subgra¬ 
dient  method  as  a  function  of  computing  budget  c  on  a  logarithmic  scale  for  the  allocation 
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Figure  1:  Estimates  of  MSE((/m(c)(A^^(a;,u;),u;))  for  the  subgradient  method  when  applied 
to  the  Erst  problem  instance  as  a  function  of  computing  budget  c  on  a  logarithmic  scale  for 
the  policy  n(c)  =  ac1/2,  with  a  =  20,  10,  1,  0.1,  and  0.05. 

policy  n(c)  =  ac1/2,  with  a  =  20,  10,  1,  0.1,  and  0.05  as  well  as  c  =  103,  104,  105,  106, 
107,  and  108.  Here  and  below,  the  MSE  is  estimated  using  30  independent  replications  of 
fm(c) We  see  that  the  slopes  of  the  lines  in  Figure  1  are  approximately  —1/2, 
which  corresponds  to  a  rate  of  convergence  of  order  c~1//2  for  the  MSE  as  predicted  in  The¬ 
orem  2  for  p  =  1/2.  While  the  rate  of  convergence  in  Theorem  1  is  independent  of  a,  we 
do  observe  some  sensitivity  to  a  numerically.  We  find  that  the  MSE  initially  decreases  as  a 
decreases  until  a  =  1.  As  a  becomes  less  than  one,  the  picture  is  less  clear  and  there  appears 
to  be  little  benefit  from  reducing  a  further.  A  examination  of  the  proof  of  Theorem  2  shows 
that  such  a  behavior  can  be  expected  as  both  the  upper  and  lower  bounds  on  the  estimator 
depend  on  a;  see  (1),  (3),  and  (4). 

6.2  Steepest  Descent  Method 

The  second  problem  instance  uses 

20 

F(x,  u)  =  di(xi  -  biCOi)2  (27) 

i—  1 

where  b{  is  randomly  generated  from  a  uniform  distribution  on  [—1,1]  and  a*  is  randomly  gen¬ 
erated  from  a  uniform  distribution  on  [1,2],  i  —  1,2,  ...,20.  The  vector  u>  =  ...jU^oy 
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Figure  2:  Estimates  of  MSE((/m(c)(A(^(x, a;), UJ))  for  the  steepest  descent  method  when 
applied  to  the  second  problem  instance  as  a  function  of  computing  budget  c  on  a  logarithmic 
scale  for  the  policy  n(c)  =  alogc,  with  a  =  2,  1,  and  0.5  as  well  as  the  policy  n(c)  =  c",  with 
v  =  0.8,  0.4,  and  0.2. 

consists  of  20  independent  and  [0,  l]-uniformly  distributed  random  variables.  This  prob¬ 
lem  instance  is  strongly  convex  with  an  unique  global  minimizer  x*  =  (x*, ...,  x^)1,  where 
x*  =  bi/2.  The  optimal  value  is  Xa=iaA2/12  —  0.730706  (rounded  to  six  digits).  We  use 
initial  solution  x  —  0. 

The  random  function  in  this  problem  instance  is  continuously  differentiable  for  all  to  G 
and  we  adopt  the  steepest  descent  method  with  Armijo  step  size  rule  as  the  optimization 
algorithm.  This  algorithm  has  at  least  a  linear  rate  of  convergence  with  rate  of  convergence 
coefficient  6  =  1  —  4Amino;(l  —  a)/3/Amax,  where  a,/3  G  (0, 1)  are  Armijo  step  size  parameters. 
We  use  a  =  0.5  and  /3  =  0.8.  Moreover,  Amjn  and  Amax  are  lower  and  upper  bounds  on  the 
smallest  and  largest  eigenvalues,  respectively,  of  V2/(x)  on  IRrf.  In  this  problem  instance, 
we  obtain  that  Amin  =  1.094895  and  Amax  =  1.991890.  Hence,  the  steepest  descent  method 
with  Armijo  step  size  rule  satisfies  Assumption  4  with  6  =  0.56. 

Figure  2  illustrates  Theorem  3  and  displays  MSE((/m(c)(A”^(x,  to),  to))  for  the  steepest 
descent  method  when  applied  to  the  second  problem  instance  as  a  function  of  computing 
budget  c  using  the  policy  n(c)  =  alogc,  with  a  =  2,  1,  and  0.5  (marked  with  circles)  and 
the  alternative  policy  n(c )  =  c",  with  v  =  0.8,  0.4,  and  0.2  (marked  with  boxes).  As  in 
the  previous  subsection,  we  consider  c  =  103,  104,  105,  106,  107,  108.  The  lines  in  Figure  2 
marked  with  circles  have  slope  quite  close  to  —1,  i.e.,  the  MSE  decays  with  rate  of  order  c_1, 
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which  is  as  predicted  in  Theorem  2.  We  see  that  the  MSE  decreases  for  smaller  a.  For  the 
largest  values  of  c  examined,  it  appears  that  a  —  1  has  a  slightly  larger  rate  of  decay  than  in 
the  case  of  a  =  2  and  a  =  0.5.  These  empirical  results  are  in  close  correspondence  with  the 
asymptotic  results  of  Theorem  2,  which  predict  an  improving  rate  of  decay  for  decreasing 
a  for  a  >  (21og(l/#))_1,  and  a  worsening  of  the  rate  for  a  <  (2  log(l/d))_1.  In  view  of  the 
above  value  of  9 ,  (2  log(l/ 6*))— 1  =  0.86. 

In  the  case  of  the  alternative  policy  n(c)  =  cv  (see  Proposition  2),  we  see  from  the  lines 
marked  with  squares  in  Figure  2  that  the  rate  of  decay  of  the  MSE  improves  as  v  decreases. 
However,  the  rate  remains  worse  than  the  ones  obtained  using  the  policy  n{c)  =  ologc. 
These  observations  are  consistent  with  the  asymptotic  results  of  Proposition  2  and  Theorem 
2:  The  policy  n(c)  =  cv  improves  as  v  tends  to  zero,  but  remains  inferior  to  the  policy 
n(c)  =  alogc,  with  a  sufficiently  large.  However,  as  this  alternative  policy  is  independent  of 
9,  it  may  be  easier  to  use  in  practice. 

6.3  Newton’s  Method 

We  illustrates  Theorem  3  by  applying  Newton’s  method  with  Armijo  step  size  to  the  second 
problem  instances  defined  in  the  previous  subsection.  On  this  problem  instance,  Newton’s 
method  has  quadratic  rate  of  convergence  and  satisfies  Assumption  5,  with  -0  =  2  and  some 
9  G  (0,  oo),  when  the  initial  solution  x  G  IRrf  is  sufficiently  close  to  the  global  minimizer  of 
Pm(w).  We  use  the  same  parameters  in  Armijo  step  size  rule  and  initial  solution  as  in  the 
previous  subsection. 

Figure  3  presents  similar  results  as  in  Figures  1  and  2  and  considers  the  policy  n(c)  = 
a  log  log  c,  with  a  =  3,  2, 1.4,  and  1.  We  again  see  that  the  slopes  of  the  lines  are  close  to  — 1, 
which  is  as  predicted  by  Theorem  3.  We  see  that  the  MSE  decreases  as  a  decreases  from  3 
to  1.4.  However,  the  rate  of  decays  are  quite  comparable.  When  a  =  1,  the  MSE  worsens. 
These  empirical  results  are  aligned  with  the  asymptotic  results  of  Theorem  3,  which  stipulate 
an  improving  rate  of  decay  for  decreasing  a  for  a  >  l/log-0.  The  quadratic  convergence  of 
Newton’s  method  implies  that  ^  =  2  and  consequently  that  the  critical  value  l/logi[)  is 
approximately  1.4.  Moreover,  Theorem  3  predicts  a  worsening  rate  of  decay  for  a  <  1/  log-0 
as  observed  empirically. 

Comparing  Figures  1,  2,  and  3,  we  see  that  the  MSE  decreases  and  the  rate  of  decay  of 
the  MSE  increases  as  faster  optimization  algorithms  are  utilized.  The  improvement  is  most 
significant  when  moving  from  a  sublinearly  to  a  linearly  convergent  optimization  algorithm. 
These  results  are  reasonable  as  a  faster  optimization  algorithm  allows  for  fewer  iterations  and 
a  larger  sample  size  as  compared  to  a  slower  optimization  algorithm.  The  improvement  is 
only  slight  when  moving  from  a  linearly  to  a  superlinearly  convergent  optimization  algorithm 
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Figure  3:  Estimates  of  MSE((/m(c)(A"^(a:,a;),u;))  for  Newton’s  method  when  applied  to 
the  second  problem  instance  as  a  function  of  computing  budget  c  on  a  logarithmic  scale  for 
the  policy  71(c)  =  a  log  log  c,  with  a  =  3,  2,  1.4,  and  1. 

as  it  only  allows  a  decrease  in  number  of  iterations  from  a  value  proportional  to  logc  to  a 
value  proportional  to  log  log  c. 

7  Conclusions 

In  this  paper  we  characterize  optimal  computing  budget  allocation  policies  in  the  sample 
average  approximation  approach  for  solving  stochastic  programs.  We  find  that  in  the  case 
of  a  sublinearly  convergent  optimization  algorithm  for  solving  the  sample  average  problem 
with  rate  of  convergence  of  order  n~p,  where  n  is  the  number  of  iterations  and  p  is  an 
algorithm  specific  parameter,  the  best  achievable  convergence  rate  is  of  order  c~p^2p+1\  In 
the  case  of  a  linearly  convergent  optimization  algorithm  with  rate  of  convergence  of  order 
9n  for  some  parameter  6  e  (0, 1),  the  best  overall  convergence  rate  is  of  order  [cj  log  c)-1/2. 
For  a  superlinearly  convergent  optimization  algorithm  with  rate  of  convergence  of  order 
9^n ,  where  9  >  0  and  G  (0,  oo),  the  best  convergence  rate  is  of  order  [cj  log  logc)-1/2. 
These  rates  are  only  obtained  using  particular  policies  for  the  selection  of  sample  sizes  and 
number  of  optimization  iterations  as  identified  in  the  paper.  The  policies  depend  on  p  in 
the  sublinear  case,  on  9  in  the  linear  case,  and  on  9  and  if}  in  the  superlinear  case.  Other 
policies  for  sample  size  and  number  of  iteration  selection  may  result  in  substantially  worse 
rates  of  decay  as  quantified  in  the  paper.  These  results  provide  a  detailed  insight  into  the 
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challenging  task  of  computing  budget  allocation  within  the  sample  average  approximation 
approach  and  may  spur  further  research  into  the  development  of  efficient  sampling-based 
algorithms  for  stochastic  optimization. 
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